NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

NOVA: A Novel Vertex Management Architecture for Scalable Graph Processing

https://doi.org/10.1109/HPCA61900.2025.00072

Fariborz, Marjan; Samani, Mahyar; York, Austin; Ben_Yoo, SJ; Lowe-Power, Jason; Akella, Venkatesh (March 2025, IEEE)

Free, publicly-accessible full text available March 1, 2026
Efficient Caching with A Tag-enhanced DRAM

https://doi.org/10.1109/HPCA61900.2025.00062

Babaie, Maryam; Akram, Ayaz; Elsasser, Wendy; Haukness, Brent; Miller, Michael R; Song, Taeksang; Vogelsang, Thomas; Woo, Steven C; Lowe-Power, Jason (March 2025, IEEE)

Free, publicly-accessible full text available March 1, 2026
CachedArrays: Optimizing Data Movement for Heterogeneous Memory Systems

Hildebrand, Mark; Lowe-Power, Jason; Akella, Venkatesh (May 2024, 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS))

Full Text Available
TDRAM: Tag-enhanced DRAM for Efficient Caching

Babaie, Maryam; Akram, Ayaz; Elsasser, Wendy; Haukness, Brent; Miller, Michael; Song, Taeksang; Vogelsang, Thomas; Woo, Steven; Lowe-Power, Jason (April 2024, arxiv)

Full Text Available
Efficient Large Scale DLRM Implementation On Heterogeneous Memory Systems

https://doi.org/10.1007/978-3-031-32041-5_3

Hildebrand, Mark; Lowe-Power, Jason; Akella, Venkatesh (May 2023, High Performance Computing: 38th International Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023, Proceedings)

We propose a new data structure called CachedEmbeddings for training large scale deep learning recommendation models (DLRM) efficiently on heterogeneous (DRAM + non-volatile) memory platforms. CachedEmbeddings implements an implicit software-managed cache and data movement optimization that is integrated with the Julia programming framework to optimize the implementation of large scale DLRM implementations with multiple sparse embedded tables operations. In particular we show an implementation that is 1.4X to 2X better than the best known Intel CPU based implementations on state-of-the-art DLRM benchmarks on a real heterogeneous memory platform from Intel, and 1.32X to 1.45X improvement over Intel’s 2LM implementation that treats the DRAM as a hardware managed cache.
more » « less
Full Text Available
SoK: Limitations of Confidential Computing via TEEs for High-Performance Compute Systems

https://doi.org/10.1109/SEED55351.2022.00018

Akram, Ayaz; Akella, Venkatesh; Peisert, Sean; Lowe-Power, Jason (September 2022, 2022 IEEE International Symposium on Secure and Private Execution Environment Design (SEED))

Full Text Available
Enabling Design Space Exploration for RISC-V Secure Compute Environments

Akram, Ayaz; Akella, Venkatesh; Peisert, Sean; Lowe-Power, Jason (June 2021, Fifth Workshop on Computer Architecture Research with RISC-V)

Cycle-level architectural simulation of Trusted Execution Environments (TEEs) can enable extensive design space exploration of these secure architectures. Existing architectural simulators which support TEEs are either based on hardware-level implementations or abstract analytic models. In this paper, we describe the implementation of the gem5 models necessary to run and evaluate the RISCV-based open source TEE, Keystone, and we discuss how this simulation environment opens new avenues for designing and studying these trusted environments. We show that the Keystone simulations on gem5 exhibit similar performance as the previous hardware evaluations of Keystone. We also describe three simple example use cases (understanding the reason of trusted execution slowdown, performance of memory encryption, and micro-architecture impact on trusted execution performance) to demonstrate how the ability to simulate TEEs can provide useful information about their behavior in the existing form and also with enhanced designs.
more » « less
Full Text Available
A Case Against Hardware Managed DRAM Caches for NVRAM Based Systems

https://doi.org/10.1109/ISPASS51385.2021.00036

Hildebrand, Mark; Angeles, Julian T.; Lowe-Power, Jason; Akella, Venkatesh (March 2021, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS))
null (Ed.)
Non-volatile memory (NVRAM) based on phase-change memory (such as Optane DC Persistent Memory Module) is making its way into Intel servers to address the needs of emerging applications that have a huge memory footprint. These systems have both DRAM and NVRAM on the same memory channel with the smaller capacity DRAM serving as a cache to the larger capacity NVRAM in the so called 2LM mode. In this work we analyze the performance of such DRAM caches on real hardware using a broad range of synthetic and real-world benchmarks. We identify three key limitations of DRAM caches in these emerging systems which prevent large-scale, bandwidth bound applications from taking full advantage of NVRAM read and write bandwidth. We show that software based techniques are necessary for orchestrating the data movement between DRAM and PMM for such workloads to take full advantage of these new heterogeneous memory systems.
more » « less
Full Text Available
Stream Floating: Enabling Proactive and Decentralized Cache Optimizations

https://doi.org/10.1109/HPCA51647.2021.00060

Wang, Zhengrong; Weng, Jian; Lowe-Power, Jason; Gaur, Jayesh; Nowatzki, Tony (February 2021, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA))
null (Ed.)
As multicore systems continue to grow in scale and on-chip memory capacity, the on-chip network bandwidth and latency become problematic bottlenecks. Because of this, overheads in data transfer, the coherence protocol and replacement policies become increasingly important. Unfortunately, even in well-structured programs, many natural optimizations are difficult to implement because of the reactive and centralized nature of traditional cache hierarchies, where all requests are initiated by the core for short, cache line granularity accesses. For example, long-lasting access patterns could be streamed from shared caches without requests from the core. Indirect memory access can be performed by chaining requests made from within the cache, rather than constantly returning to the core. Our primary insight is that if programs can embed information about long-term memory stream behavior in their ISAs, then these streams can be floated to the appropriate level of the memory hierarchy. This decentralized approach to address generation and cache requests can lead to better cache policies and lower request and data traffic by proactively sending data before the cores even request it. To evaluate the opportunities of stream floating, we enhance a tiled multicore cache hierarchy with stream engines to process stream requests in last-level cache banks. We develop several novel optimizations that are facilitated by stream exposure in the ISA, and subsequent exposure to caches. We evaluate using a cycle-level execution-driven gem5-based simulator, using 10 data-processing workloads from Rodinia and 2 streaming kernels written in OpenMP. We find that stream floating enables 52% and 39% speedup over an inorder and OOO core with state of art prefetcher design respectively, with 64% and 49% energy efficiency advantage.
more » « less
Full Text Available
Performance Analysis of Scientific Computing Workloads on General Purpose TEEs

Akram, Ayaz; Giannakou, Anna; Akella, Venkatesh; Lowe-Power, Jason; Peisert, Sean (January 2021, 35th IEEE International Parallel & Distributed Processing Symposium)
null (Ed.)
Scientific computing sometimes involves computation on sensitive data. Depending on the data and the execution environment, the HPC (high-performance computing) user or data provider may require confidentiality and/or integrity guarantees. To study the applicability of hardware-based trusted execution environments (TEEs) to enable secure scientific computing, we deeply analyze the performance impact of general purpose TEEs, AMD SEV, and Intel SGX, for diverse HPC benchmarks including traditional scientific computing, machine learning, graph analytics, and emerging scientific computing workloads. We observe three main findings: 1) SEV requires careful memory placement on large scale NUMA machines (1×– 3.4× slowdown without and 1×–1.15× slowdown with NUMA aware placement), 2) virtualization—a prerequisite for SEV— results in performance degradation for workloads with irregular memory accesses and large working sets (1×–4× slowdown compared to native execution for graph applications) and 3) SGX is inappropriate for HPC given its limited secure memory size and inflexible programming model (1.2×–126× slowdown over unsecure execution). Finally, we discuss forthcoming new TEE designs and their potential impact on scientific computing.
more » « less
Full Text Available

« Prev Next »

Search for: All records